Effective Data Visualization
Steve Elston
01/10/2022
Soving Data Science Coding Problems
Correctly using programming environments is a core data science skill; Python, R, SQL,…..
- Read the documentation!
- Package documentation should be your first stop
- Do you know what all the arguments do?
- Are there examples/User Guides?
- Search for answers online
- Search package Wiki
- Search reliable third party sources - e.g. StackOverflow
- Chances are some package nija has already answered the question you have
- Trial and error
- Sometimes you just have to test several possibilities
- Work with the smallest unit of code to reproduce the problem
- Use a small subset of the data - check types and formats
- Try a symbolic debugger
Effective Visualization for Exploration and Communications
Visualization is primarily a form of communications
- To be effective, visualization must be:
- Clear
- Well Organized
- Simple
- Creating effective visualization is difficult
- Requires significant effort
- Try lots of ideas, fail fast, keep successful results
Visualizing Large Complex Data is Difficult
Problem: Modern data sets are growing in size and complexity
Goal: Understand key relationships in large complex data sets
Difficulty: Large data volume
- Modern computational systems have massive capacity
- Example: Use map-reduce algorithms on cloud clusters
Difficulty: Large numbers of variables
- Huge number of variables with many potential relationships
- Dealing with complexit is the hard part!
Limitation Scientific Graphics
All scientific graphics are limited to a 2-dimensional projection
But, complex data sets have a great many dimensions
We need methods to project large complex data onto 2-dimensions
Generally, multiple views are required to understand complex data sets
- Don’t expect one view to show all important relationships
- Develop understanding over many views
- Try many views, don’t expect most to be very useful
Why is Perception Important?
Goal: Communicate information visually
Visualization technique maximize the information a viewer perceives
Limits of human perception are a significant factor in understanding complex relationships
Can apply results of the considerable research on human perceptions for data visualization
Use Aesthetics to Improve Perception
We explore aesthetics to improve perception
We take a very broad view of the term ‘aesthetic’ here
A plot aesthetics is any property of a visualization which highlight some aspect of the data relationships
Aesthetics are used to project additional dimensions of complex data onto the 2-dimensional plot surface
Over-plotting
Over-plotting occurs in plots when the markers lie one on another.
- Common, even in relatively small data sets
- Scatter plots can look like a blob and be completely uninterpretable
- Over-plotting is a significant problem in EDA and presentation graphics
Dealing with Over-plotting
What can we do about over-plotting?
Marker transparency: so one can see markers underneath; useful in cases with minimal overlap of markers
Marker size: smaller marker size reduces over-plotting within limits
Adding jitter: adding a bit of random jitter to variables with limited number of values
Down sample: visualize a subset of the full data
Example of Overplotting

Use Transparency, Marker Size, Downsampling

Alternatives to avoid over-plotting for truly large data sets
- Hex bin plots: the 2-dimensional equivalent of the histogram
- Frequency of values is tabulated into 2-dimensional hexagonal bins
- Displayed using a sequential color palette
- 2-d kernel density estimation plots: natural extension of the 1-dimensional KDE plot
- Good for moderately large data
- Heat map: values of one variable against another
- Categorical (count) or continuous variables
- Carefully choose color pallet, sequential or divergent
- Mosaic plots: display multidimensional count (categorical) data
- Uses tile size and color to project multiple dimensions
- 2-d equivalent of a multi-variate bar chart
Hexbin Plot

Countour Plot

Other Methods to Display Large Data Sets
Sometimes a creative alternative is best
Often situation specific; many possibilities
Finding a good one can require significant creativity!
Example, choropleth for multidimensional geographic data
Example, time series of box plots
Time Series of Box Plots

bivariate measures
Pearson’s correlation looks for a linear relationship
Spearman’s rank correlation is Pearson’s correlation applied to ranks (min = rank 1, max = rank \(n\))
image source: wikipedia.com
joint plots
- this is what a 2D density plot looks like, similar to a heatmap or contour plot
- we can also imagine a 2D histogram, but visually it’s not practical
- they can resemble theoretical distributions, like the bivariate normal distribution
- image source: seaborn.pydata.org/
Organization of Plot Aesthetics
We can organize aesthetics by their effectiveness:
Easy to perceive plot aesthetics: help most people gain understanding of data relationships
Aesthetics with moderate perceptive power: useful properties to project data relationships when used sparingly
Aesthetics with limited perceptive power: useful within strict limits
Properties of Common Aesthetics
| Aspect ratio |
Good |
Numeric |
| Regression lines |
Good |
Numeric plus categorical |
| Marker position |
Good |
Numeric |
| Bar length |
Good |
Counts, numeric |
| Sequential color palette |
Moderate |
Numeric, ordered categorical |
| Marker size |
Moderate |
Numeric, ordered categorical |
| Line types |
Limited |
Categorical |
| Qualitative color palette |
Limited |
Categorical |
| Marker shape |
Limited |
Categorical |
| Area |
Limited |
Numeric or categorical |
| Angle |
Limited |
Numeric |
Aspect Ratio
- Aspect ratio has a significant influence on how a viewer perceives a chart
- Correct aspect ratio can help highlight important relationships in complex data sets
- We express aspect ratio as follows:
\[aspect\ ratio = \frac{width}{height}\ : 1\]
- Banking angle is key to understanding how the aspect ratio affects perception
Example of Changing Aspect Ratio
Longest scientific time series is the sunspot count:
## YEAR SUNACTIVITY
## 0 1700.0 5.0
## 1 1701.0 11.0
## 2 1702.0 16.0
## 3 1703.0 23.0
## 4 1704.0 36.0
Example of Changing Aspect Ratio
- Example uses data from 1700 to 1980
- Can you perceive the asymmetry in these sunspot cycles?

Example of Changing Aspect Ratio
- Notice how changing aspect ratio change perception of the asymmetry?

Sequential and Divergent Color Palettes
Use of color as an aesthetic in visualization is a complicated subject.
- color is often used, also often abused
- A qualitative palette is a palette of individual colors to display categorical values
- Sequential palettes and divergent palettes are a sequence of colors used to display a quantitative variable or ordered categorical variable
Auto Weight by Sequential Color Palette

Limits of color
Regardless of the approach there are some significant limitations
- A significant number of people are color blind. Red-green color blindness is most common
- Even the best sequential or divergent palettes show only relative value of numeric variables
- Perception of exact numeric values is difficult, except in special cases
Marker Size
Marker size is moderately effective aesthetic useful for quantitative variables
- Used properly, marker size can highlight important trends in complex data sets
- But, the viewer can generally perceive relative differences, but not actual values
- Small size differences are not preceptable
Engine Size by Marker Size and Price by Sequential Color Palette

Line Plots and Line Type
Line plots connect discrete, ordered, data points by a line
- Can use different line pattern types to differentiate data
- Only useful for a limited number of lines on one graph
- Too many similar line pattern on one plot leads to viewer confusion and poor perception
Limits of Line Type

Marker Shape
Marker shape is useful for displaying categorical relationships
- This aesthetic is only useful when two conditions are met:
- The number of categories is small
- Distinctive shape are chosen for the markers
- Human perception limits the number of shapes humans can perceive well
Aspiration by Marker Shape

higher dimensional data
In higher dimensions, things get more challenging, but still manageable up to a certain point (usually 5 or so dimensions)
- we can use aesthetics to add additional dimensions to visualizations, but we quickly run out of elements
- we can use faceting to break up a plot into many, but having too many plots to look at can be overwhelming
As dimensionality goes up, we need to rely on more advanced methods, but as we learn later there’s no such thing as a free lunch
- run a ML algorithms for dimensionality reduction
- use visualizations such as t-SNEs meant to deal with such situations
Aesthetics
- how many dimensions are represented in this plot?
- be careful not to overdo it with aesthetics
- image source: seaborn.pydata.org/
Facet plots
- how many dimensions are represented in this plot?
- faceting is helpful when we have categorical data with a handful of categories
- keep the comparisons to a minimum otherwise you risk overplotting
- image source: seaborn.pydata.org/
Facet Plot with Weather by Season

Correlation matrix
- this is a visualization of the correlation matrix as a heatmap
- it can be used to see which variables are correlated with which others
- image source: seaborn.pydata.org/
Scatter plot matrix
- this is the scatter plot version of the correlation matrix
- color-coding can be used to add a 3rd dimension
- image source: seaborn.pydata.org/
Simpson’s paradox
Simpson’s paradox gives rise to false associations
A trend appears to have one trend
But another trend when data is grouped by another variable
image source: wikipedia.com
Simpson’s paradox
Simpson’s paradox arises from a latent variable
- Latent variable is ‘hidden’
- Not considered in an analysis
- Unobservable or data is unavailable
- Examples of unobservatble data
- Someone’s intention; only observe actions or responses
- The presence of a disease; only observe symptoms
- The temperature of the surface of the sun; only observe spectra
Simpson’s paradox
With categorical data, Simpson’s paradox can occur when the relative size of the groups is different between the control and treatment
Anscombe’s quartet
\(x\) and \(y\) in the four data sets have the same mean, variance, correlation and trend line \(y = a + bx\) if we use linear regression to find \(a\) and \(b\)
image source: wikipedia.com
Summary
We have explored these key points
- Proper use of plot aesthetics enable projection of multiple dimensions of complex data onto the 2-dimensional plot surface.
- All plot aesthetics have limitations which must be understood to use them effectively
- The effectiveness of a plot aesthetic varies with the type and the application
the end